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Abstract.  We  address  a  problem  of  identifying  nodes  having  a  high 
centrality  value  in  a  large  social  network  based  on  its  approximation 
derived  only  from  nodes  sampled  from  the  network.  More  specifically, 
we  detect  gaps  between  nodes  with  a  given  confidence  level,  assuming 
that  we  can  say  a  gap  exists  between  two  adjacent  nodes  ordered  in 
descending  order  of  approximations  of  true  centrality  values  if  it  can 
divide  the  ordered  list  of  nodes  into  two  groups  so  that  any  node  in  one 
group  has  a  higher  centrality  value  than  any  one  in  another  group  with 
a  given  confidence  level.  To  this  end,  we  incorporate  confidence  intervals 
of  true  centrality  values,  and  apply  the  resampling-based  framework  to 
estimate  the  intervals  as  accurately  as  possible.  Furthermore,  we  devise 
an  algorithm  that  can  efficiently  detect  gaps  by  making  only  two  passes 
through  the  nodes,  and  empirically  show,  using  three  real  world  social 
networks,  that  the  proposed  method  can  successfully  detect  more  gaps, 
compared  to  the  one  adopting  a  standard  error  estimation  framework, 
using  the  same  node  coverage  ratio,  and  that  the  resulting  gaps  enable 
us  to  correctly  identify  a  set  of  nodes  having  a  high  centrality  value. 
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1  Introduction 

Recently,  social  media  such  as  Facebook,  Digg,  Twitter,  etc.  becomes  an  extremely 
popular  communication  tool  on  a  global  scale,  and  generates  large-scale  social 
networks  on  the  web.  Such  networks  allow  us  to  share  a  wide  variety  of  topics 
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that  have  been  posted  on  social  media  because  those  topics  can  rapidly  and 
widely  spread  through  the  networks.  Thus,  in  recent  years,  social  media  plays 
an  important  role  as  information  infrastructure,  and  social  networks  constructed 
on  it  have  been  extensively  investigated  from  various  angles  [4,8]. 

In  such  social  network  analysis,  we  can  get  an  insight  into  some  features  of 
a  given  network  by  using  the  node  centrality  [1,3,5,7,14],  which  characterizes 
nodes  in  the  network  based  on  its  topology.  Typical  ones  include  the  degree, 
closeness,  and  betweenness  centralities.  Some  of  them  such  as  the  degree  cen¬ 
trality  are  based  only  on  the  information  of  neighboring  nodes  of  a  target  node, 
but  some  others  are  also  on  global  structure  of  a  network.  For  example,  to  com¬ 
pute  the  betweenness  centrality,  we  have  to  enumerate  paths  between  arbitrary 
node  pairs,  which  is  computationally  very  expensive.  Since  a  social  network  on 
the  web  can  easily  grow  in  size,  it  is  crucial  to  efficiently  compute  values  of  such 
a  centrality  to  analyze  a  large  network. 

To  this  kind  of  problem  on  scalability,  sampling-based  approaches  have  been 
proposed  so  far  [6,10,11],  which  investigate  sampling  methods  that  can  obtain 
better  approximations  of  true  centrality  values.  Those  methods  are  roughly  cate¬ 
gorized  into  uniform  sampling,  non-uniform  sampling,  and  traversal/ walk-based 
sampling.  In  contrast  to  them,  we  proposed  a  framework  that  ensures  the  accu¬ 
racy  of  the  approximations  under  uniform  sampling  [13],  in  which  we  estimated 
the  approximation  error  referred  to  as  resampling  error  by  considering  all  pos¬ 
sible  partial  networks  of  a  fixed  size  that  are  generated  by  resampling  nodes 
according  to  a  given  coverage  ratio  and  approximated  centrality  values  derived 
from  them.  It  is  empirically  shown  that  the  resampling-based  framework  provides 
a  tighter  approximation  error  with  a  higher  confidence  level  than  the  traditional 
standard  error  in  statistics  under  a  given  sampling  ratio. 

Unlike  these  existing  approaches,  in  this  paper,  we  consider  detecting  a  set 
of  nodes  having  a  high  centrality  value  only  from  approximations  derived  from 
sampled  nodes  with  an  adequate  confidence  level,  instead  of  trying  to  accurately 
estimate  the  centrality  value  itself.  We  are  interested  in  such  nodes  because 
they  tend  to  play  an  important  role  for  information  diffusion  on  the  network. 
To  this  end,  we  consider  a  list  of  nodes  in  descending  order  of  the  approximate 
centrality  value,  and  devise  an  algorithm  to  efficiently  detect  gaps  that  exist 
between  two  adjacent  nodes  in  the  list.  Here,  we  say  a  gap,  or  a  boundary 
exists  between  two  adjacent  nodes  in  the  list  if  it  can  divide  the  ordered  list  of 
nodes  into  two  groups  so  that  any  node  belonging  to  one  group  has  a  higher 
centrality  value  than  any  node  in  another  group  with  a  given  confidence  level. 
We  incorporate  confidence  intervals  of  true  centrality  values  for  each  node  to 
detect  such  gaps,  and  adopt  the  above  resampling-based  estimation  framework 
to  estimate  the  confidence  intervals  as  accurately  as  possible.  The  results  of 
extensive  experiments  on  three  real  world  social  networks  demonstrate  that  using 
the  resampling  error  for  detecting  gaps  outperforms  using  the  standard  error  in 
terms  of  the  number  of  gaps  detected,  and  that  the  resulting  gaps  allow  us  to 
correctly  identify  nodes  having  a  high  centrality  value. 
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2  Resampling-Based  Estimation  Framework 

In  this  section,  according  to  the  work  [13],  we  revisit  the  resampling-based  frame¬ 
work  for  estimating  an  approximation  error  with  a  given  confidence  level  and  its 
application  to  computing  the  node  centrality. 


2.1  General  Framework 

Let  S'  be  a  set  of  objects  such  that  |S|  =  L ,  and  /  a  function  that  assigns  a  value 
to  each  object  s  £  S.  Then,  the  problem  we  address  is  estimating  the  average  p, 
over  the  set  of  entire  values  {/(s)  |  s  £  S}  only  from  its  arbitrary  subset  of  partial 
values  {f(t)  1 1  £  T  C  S}.  Let  p(T)  be  the  partial  average  over  a  subset  T  whose 
number  of  elements  is  N,  i.e.,  p(T)  =  (1  /N)  /(*)■  Then,  we  consider  using 

this  partial  average  p(T)  as  an  approximate  solution  of  the  true  average  p  and 
estimating  an  expected  approximation  error  iLE(iV),  referred  to  as  resampling 
error,  which  is  the  difference  between  p  and  p(T),  with  respect  to  the  number 
of  elements  N,  if  L  is  too  large  to  compute  p.  Given  T  C  2s  that  is  a  family 
of  subsets  of  S  such  that  |T|  =  TV  for  T  £  T,  the  resampling  error  RE(N)  is 
defined  as  follows: 


RE(N)  =  V((r-r(T))2)Tgt 


yl» 


teT 


C(N)a , 


, _  (1) 

where  the  factor  C(N)  =  \J(L  —  N)  /  ((L  —  l)iV)  and  a  = 
J2seS{f(,s)  —  p)2  is  the  standard  deviation.  Note  that  since  the 
estimation  error  of  Equation  (1)  is  regarded  as  the  standard  deviation  with 
respect  to  the  number  of  elements  N ,  we  can  claim  from  a  statistical  viewpoint 
that  for  a  given  subset  T  such  that  |Tj  =  N,  and  its  partial  average  value 
/r(T),  the  probability  that  |/u(T)  —  n\  is  larger  than  1.96  x  RE(N ),  is  less  than 
5%.  In  other  words,  the  range  of  r(T)  ±  1.96  x  RE(N)  is  regarded  as  the  95% 


confidence  interval  of  /i. 

On  the  other  hand,  we  can  consider  a  standard  approach  to  this  problem  that 
is  based  on  the  i.i.d.  (independently  identical  distribution)  assumption.  More 
specifically,  for  a  given  subset  T  that  has  N  elements,  that  is,  T  =  {ti,  ■  ■  ■  ,  tjv}, 
it  is  assumed  that  each  element  t  £  T  is  independently  selected  according  to 
some  distribution  p(t )  such  as  an  empirical  distribution  p(t)  =  1/L.  Then,  the 
standard  error  SE(N)  based  on  this  assumption  is  defined  as  follows: 


SE(N)  =  ^((h  -  m(T))2> 


N  \  2  JV 

x  Yi  /(*") )  n  = -D(Ar)cr> 

n=l  /  n=l 


(2) 

where  D(N)  =  1/y/N  and  a  is  the  standard  deviation. 

It  is  noted  that  the  difference  between  Equations  (1)  and  (2)  is  only  their 
coefficient  terms,  C(N)  and  D(N),  and  that  C(N)  <  D(N),  C(L)  =  0  and 
D(L )  ^  0.  Namely,  RE(N)  <  SE(N)  for  any  N ,  and  RE(N)  becomes  0  when 
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N  =  L,  but  not  SE(N).  Note  that  the  true  standard  deviation  a  is  needed 
in  both  Equations  (1)  and  (2),  but  in  practice,  we  can  use,  instead  of  a ,  the 
standard  deviation  a'  that  is  derived  from  a  subset  S'  (c  S)  such  that  \S'\  =  L' 
is  small  enough  to  compute  a'  within  a  reasonable  time  if  |<S|  is  too  large  to 
compute  (J,  which  is  just  the  case  where  sampling  is  needed. 


2.2  Application  to  Node  Centrality  Estimation 


Next,  we  present  the  way  to  apply  the  above  estimation  framework  to  node 
centrality  estimation  of  a  social  network  that  is  represented  as  a  directed  graph 
G  =  (V,  E),  where  V  and  E  (c  V  x  V)  are  the  sets  of  all  the  nodes  and  the  links 
in  the  network,  respectively.  Here,  we  consider  two  node  centrality  measures,  the 
closeness  centrality  and  the  betweenness  centrality  as  in  [13]. 

The  closeness  cIsg{u)  of  a  node  u  on  a  graph  G  is  defined  as 


clsG(u ) 


1 

(l^Fi) 


E 

V£V,Vy^U 


1 

splG(u,v )  ’ 


(3) 


where  splG(u,v)  stands  for  the  shortest  path  length  from  u  to  v  in  G,  and  we 
set  splG(u,v)  =  oo  when  node  v  is  unreachable  from  node  u  on  G.  Intuitively, 
a  node  u  has  a  high  value  for  this  closeness  centrality  if  a  large  number  of 
nodes  are  reachable  from  u  within  relatively  short  path  lengths.  A  standard 
technique  for  computing  clsG(u)  of  each  node  u  £  V  is  the  burning  algorithm 
[12]  whose  computational  complexity  is  0(\E\).  Thus,  it  takes  a  large  amount  of 
computation  time  for  a  huge  social  network  consisting  of  millions  of  nodes.  To 
apply  the  above  estimation  framework  to  the  computation  of  an  approximation 
of  the  closeness  centrality  clsG{u)  of  each  node  u  £  V,  we  instantiate  the  set 
of  objects  S  and  the  function  /  to  this  problem.  In  fact,  we  consider  Su  = 
V  \  {«}  as  the  set  S  and  fu(y )  =  1  /splG(u,  v)  as  the  function  /,  and  thereby  can 
calculate  a  partial  average  value  clsG(u ;  T )  from  an  arbitrary  subset  T  C  SViUju} 
and  its  approximation  error,  RE(u ;  |T|)  and  SE(u ;  |T|),  according  to  the  above 
framework. 

Next,  the  betweenness  btwG(u)  of  a  node  u  on  a  graph  G  is  defined  as 


btwG(u )  = 


(|V|-l)(|V|-2) 


E  E 

V(z.V,V^ll  \w(zV,W^U,W^V 


nspG(v ,  w\  u) 
nspG(v,  w) 


(4) 


where  nspG(v,w)  is  the  number  of  the  shortest  paths  from  v  to  w  in  G,  and 
nspG(v,  w,  u )  is  the  number  of  the  shortest  paths  from  v  to  w  that  pass  through 
node  u.  Thus,  the  betweenness  of  a  node  u  becomes  high  if  a  large  number  of 
shortest  paths  between  two  nodes  pass  through  node  u.  The  Brandes  algorithm 
[2]  is  a  standard  technique  for  computing  btwG(u)  of  each  node  u  £  V  and  its 
computational  complexity  is  0{\E\).  Thus,  it  requires  a  large  amount  of  com¬ 
putation  time  for  a  large  social  network,  too.  Again,  we  consider  instantiating 
S  and  /  of  the  above  estimation  framework  for  computing  an  approximation 
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of  the  betweenness  centrality  btwc{u).  More  specifically,  we  regard  the  expres¬ 
sion  inside  the  large  parentheses  in  Equation  (4)  as  a  function  btwc(u;v),  the 
betweenness  of  node  u  that  restricts  its  starting  node  to  v.  Then,  by  considering 
Su  =  V\{u}  and  fu(v)  =  btwc{u\  v)/(\V\  —  2),  we  can  calculate  a  partial  average 
value  btwc(u;  T )  from  an  arbitrary  subset  T  C  S'uU{rt}  and  its  estimation  error, 
RE(u\  |T|)  and  SE[u\  |T|),  based  on  the  above  estimation  framework. 

3  Gap  Detection  Method 

In  this  section,  we  consider  the  way  to  detect  a  set  of  nodes  having  a  high 
centrality  value  with  a  given  confidence  level  based  only  on  centrality  values 
estimated  from  a  subset  of  nodes  in  a  network.  First  of  all,  we  formally  define  the 
problem  we  address  here.  For  a  network  G(V,E ),  let  /iG(w)  be  the  true  value  of 
a  certain  centrality  measure  for  node  v  G  V,  Rg(v;  T)  be  its  estimation  derived 
only  from  a  subset  of  nodes  T  C  V,  and  <j(v\  |Tj)  be  its  approximation  error 
such  as  RE(v\  |Tj)  and  SE{v ;  |T|).  In  addition,  given  a  node  v,  let  Vh(v:  T)  = 
{u  G  V;hg(u;T)  >  /xG(u;T)}  and  VL(v,T)  =  {w  G  R;/uG(w;T)  <  /rG(v;T)} 
be  disjoint  partitions  of  V  with  respect  to  /xG(u;T).  Then,  incorporating  the 
confidence  interval  estimation  in  statistics,  the  problem  can  be  defined  as  finding 
out  all  nodes  v  G  V  that  satisfy  the  following  inequality  for  Vit  G  Vh(v,T)  and 
\/w  G  Vl{v;T): 

T)  -  z(a)  ■  er(u;  |T|)  >  ^G(w\  T)  +  z(a)  ■  cr(w ;  |T|)  (5) 

where  0  <  a  <  1  and  z(a)  is  the  upper  a/2  critical  value  of  the  standard  nor¬ 
mal  distribution.  In  other  words,  RG(u)  >  RG(u’)  holds  for  Vu  G  Vh(v;T )  and 
\/w  G  Vl(v;T)  with  the  confidence  level  C  =  100(1  —  a)%.  Here,  the  upper 
half  set  Vh{v\  T)  is  a  set  that  we  want  to  identify,  and  we  say  that  a  gap  exists 
between  v  and  v'  G  arg  ma xweVL(v,T)  It  is  obvious  that  a  straightfor¬ 

ward  approach  to  this  problem  requires  the  computational  complexity  of  0(|E|3) 
because  it  has  to  check  \Vh{v;T)\\Vl(v,T)\  pairs  of  nodes  for  each  v,  which  is 
not  acceptable  when  a  given  social  network  is  very  large. 

To  cope  with  this,  we  first  consider  a  lower  error  bound  of 
Vh{v\T )  and  an  upper  error  bound  of  Vl(v\T),  respectively  defined  as 
LB(VH(v;T);a)  =  min ueVH^(iiG(u-T)- z(a)a(u;  |Tj))  and  UB(VL(v;T);a)  = 
ma 'x-W£vl(v)(ij,g{w,  T)+z(a)a(w,  |Tj)).  Hereafter,  for  simplicity,  LB(Vh(v,  T)\  a) 
and  UB(Vl(v;T);  a)  are  denoted  by  LB(Vh (u);  T,  a)  and  UB(Vl(v)‘,  T,  a), 
respectively.  Then,  we  focus  on  the  fact  that  the  above  problem  is  reduced  to  find¬ 
ing  all  nodes  v  G  V  that  satisfy  the  relation  LB(Vh(v);  T,  a)  >  UB(Vl(v );  T,  a) 
for  given  a.  Since  both  LB(Vh(v)-,T,  a)  and  UB(Vl{v);  T,  a)  can  be  simultane¬ 
ously  computed  for  arbitrary  v  G  V  by  making  only  one  pass  through  V,  the 
total  computational  complexity  becomes  0( |E|2),  which  is  smaller  than  0(|E|3), 
but  it  is  still  hard  to  find  all  of  such  nodes  when  the  size  of  a  network  gets  larger. 

Thus,  we  further  consider  an  ordered  list  ,^|y|)  of  nodes  in  V 

resulted  from  sorting  them  in  descending  order  of  the  value  of  hg(v;T),  i.e., 
Hc(vi-,T)  >  iiG{vi+i]T)  for  i  G  {!,-••  ,\V\  -  1}.  Then,  LB(VH(vk);T,a)  is 
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recursively  defined  as  LB^Vnivk),  T ,  a)  =  mm(LB(VH(vk-i)',  T,  a),  HG{vk',  T)  — 
z(a)a(vk',  |T|)).  As  well,  UB(VL(vk);T,a)  is  defined  as  UB(VL(vk);T,a)  = 
max.(UB(VL(vk+1);T,a),  nG{vk+i',T)  +  z(a)cr(vk+1-,\T\)).  Considering  these 
definitions,  we  can  compute  LB(Vh(v );  T,  a)  and  U B(Vl( v);  T,  a)  for  every  node 
v  €  V  by  making  only  one  pass,  each,  through  the  list  (vi,vz,  ■  ■  ■  ,  U|y|),  respec¬ 
tively,  which  implies  that  we  can  detect  all  gaps  by  making  two  passes  through 
the  ordered  list.  More  specifically,  in  the  first  pass,  referred  to  as  the  forward 
step,  we  compute  LB(yji{vk)\  T,  a)  varying  k  from  1  to  |  V|  —  1,  and  then,  in  the 
second  pass  called  the  backward  step,  we  compute  C/i?(Vt(ufc);  T,  a)  and  detect 
a  gap  if  LB(VH(vk );  T,  a)  >  UB(VL,{vk)\ T,  a)  holds  varying  k  from  \  V\  to  2.  The 
computational  complexity  of  this  method  is  governed  by  that  of  its  sorting  pro¬ 
cess,  and  thus  becomes  0(\V\  log  |V|),  which  enables  the  practical  gap  analysis 
even  for  a  large  social  network.  The  procedure  is  summarized  as  follows: 

1.  A  <—  0,  LB(VH(vi)-,T,a)  =  nG{v\\T)  -  z(a)a(vi;\T\)),  and 

UB(VL(vlvl)-,T,a)  =  0-, 

2.  (Forward  step)  For  k  =  2  to  \V\  —  1, 

LB(VH(vk );  T,  a)  =  min(LB(VH{vk-i)-,  T,  a),  HG(vk',  T)  -  z(a)a(vk]  |T|)); 

3.  (Backward  step)  For  k  =  \V\  —  1  to  2, 

(a)  UB(VL(vk)-,T,a )  =  max(UB(yL(vk+i);T,a),  ^G{vk+i\T)  + 

z{a)a(vk+l\ IT’D); 

(b)  A^A  UK}  if  LB(VH(vk);T,a)  >  UB(VL(vk);T,a); 

4.  Output  A,  and  terminate. 

We  consider  three  kinds  of  methods  by  adopting  different  definitions  of  the 
estimated  error  a(v;  |T|),  which  are  a(v;  |T|)  =  0,  a(v;  |T|)  =  SE(v,  |T|),  and 
a(v;  |T|)  =  RE{v\  |T|).  We  refer  to  these  methods  as  the  naive,  SE,  and  RE 
method,  respectively.  Note  that  the  naive  method  assumes  hg{v\T)  =  hgW). 
Thus,  it  determines  that  there  exists  a  gap  between  nodes  vk  and  vk+\  for  every  k 
such  that  HG{vk;  T)  ^  nc{vk+ 1;  T).  On  the  other  hand,  since  SE(v,  |T|)  overesti¬ 
mates  the  approximation  error  of  ^g(v;  T)  compared  to  RE(y ;  |T|),  the  number 
of  gaps  detected  by  the  SE  method  becomes  less  than  that  by  the  RE  method. 
For  more  details,  we  empirically  compare  these  methods  through  experiments 
on  real  world  social  networks  as  described  below. 

4  Experiments 

4.1  Datasets 

We  empirically  evaluated  the  three  gap  detection  methods  described  in  the  pre¬ 
vious  section  on  three  datasets  of  real  world  networks  that  are  represented  as 
directed  graphs.  The  first  dataset  is  a  network  extracted  from  a  Japanese  blog 
service  site  “Ameba”1,  which  has  56,604  nodes  representing  blogs  in  “Ameba” 
and  734, 737  directed  links  among  them.  Each  directed  link  is  constructed  from 


1  http://www.ameba.jp/ 
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Fig.  1.  Centrality  values  and  their  standard  deviations  of  the  top  1,000  nodes  in 
descending  order  of  the  true  value  of  each  centrality  in  the  Ameblo,  Cosme,  and  Enron 
networks 


blog  u  to  blog  v  if  blog  u  is  registered  as  a  favorite  one  in  blog  v.  We  refer  to  this 
network  as  the  Ameblo  network.  The  second  one  is  a  network  extracted  from  a 
Japanese  word-of-mouth  communication  site  for  cosmetics,  “@cosme” 2 ,  consist¬ 
ing  of  45,024  nodes  representing  its  users  and  351,299  directed  links,  in  which 
a  link  (it,  v)  means  that  user  v  registers  user  u  as  her  favorite  one.  We  refer  to 
this  directed  network  as  the  Cosme  network.  The  last  one  is  a  network  derived 
from  the  Enron  Email  Dataset  [9],  which  has  19, 603  nodes  and  210, 950  links.  In 
this  network,  a  node  is  an  email  address  that  appears  in  the  dataset  as  either  a 
sender  or  a  recipient,  while  a  directional  link  (u,  v)  between  two  email  addresses 
u  and  v  means  that  u  sent  an  email  to  v.  We  refer  to  this  directed  network  as 
the  Enron  network.  These  three  networks  are  not  very  huge,  but  large  enough 
to  investigate  the  basic  performance  of  the  three  methods  from  various  angles. 
We  thus  simply  use  the  standard  deviation  a  derived  from  S  to  compute  the 
resampling  and  standard  errors. 

Figures  1(a)  to  1(c)  show  the  top  1,000  nodes  in  descending  order  of  true 
value  of  the  closeness  centrality  in  the  Ameblo,  Cosme,  and  Enron  networks, 
respectively,  while  Figures  1(d)  to  1(f)  show  the  top  1,000  nodes  in  descend¬ 
ing  order  of  true  value  of  the  betweenness  centrality  for  the  same  three  net¬ 
works.  We  only  plotted  the  top  1,000  nodes  because  we  are  interested  in  nodes 
having  high  centrality  values.  In  each  figure,  the  horizontal  axis  indicates  the  val¬ 
ues  of  corresponding  centrality,  and  the  vertical  axis  shows  its  standard 
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deviation  defined  as  a^G (u)  =  y / (|F|  -  1)  1  J2vev,Vyiu  (fu(v)  -  MgO))2,  where 
Hg{u)  sands  for  either  cIsg(u )  or  btwciu ),  and  fu{v)  is  1/ splc{u,v)  for  cIsg{u ) 
and  btwG(u;v)/( \V\  —  2)  for  btwciu).  From  these  figures,  we  can  observe  that 
higher-ranked  nodes  in  each  centrality  measure  are  distinguishable  from  each 
other  in  every  network  because  of  their  distinctive  values  of  the  centrality,  while 
it  looks  hard  to  do  the  same  for  lower-ranked  nodes.  This  tendency  can  be 
found  more  clearly  in  the  plots  for  the  betweenness  centrality  in  which  nodes 
are  scattered  over  a  larger  area.  From  these  observations,  we  can  expect  that  it 
is  harder  to  detect  gaps  that  exist  between  lower-ranked  nodes  compared  to  the 
ones  between  higher-ranked  nodes  and  that  more  gaps  can  be  detected  for  the 
betweenness  centrality  than  for  the  closeness  centrality. 

4.2  Results 

We  applied  the  naive,  SE,  and  RE  methods  to  the  three  networks  mentioned 
above  for  the  closeness  and  betweenness  centralities,  and  examined  the  number 
of  gaps  they  detected  and  how  many  gaps  among  them  were  correct.  A  correct 
gap  is  the  one  that  the  resulting  upper  half  set  Vnivk'-  T)  corresponds  exactly  to 
the  true  upper  half  set  that  is  a  set  of  the  top  k  nodes  in  the  descending  order 
of  the  true  centrality  value.  In  this  experiment,  we  adopted  the  confidence  level 
of  95%  ( a  =  0.05)  as  a  typical  one  and  fixed  it,  while  we  varied  the  coverage 
|T|/|V|  from  0.01  to  1.00  by  0.01  points  to  see  how  the  number  of  gaps  detected 
changes  according  to  the  coverage.  More  precisely,  we  randomly  sampled  nodes 
from  V  without  replacement,  added  it  to  the  subset  T  one  by  one,  and  counted 
the  number  of  gaps  detected  and  the  number  of  gaps  correctly  detected  each  time 
the  coverage  increases  by  0.01.  Since  we  are  interested  in  nodes  having  a  high 
centrality  value,  we  considered  only  the  top  K  nodes  in  descending  order  of  the 
estimated  value  of  the  corresponding  centrality  at  each  coverage.  We  repeated 
this  process  R  =  1,  000  times  and  computed  the  average  over  them. 

Figure  2  shows  the  results  for  the  closeness  centrality  in  the  case  of  K  =  100. 
The  horizontal  axis  means  the  coverage,  and  the  vertical  axis  means  the  number 
of  gaps.  The  blue  solid  line  and  the  red  broken  line  represent  the  number  of 
gaps  detected  and  the  number  of  gaps  incorrectly  detected  by  the  corresponding 
method,  respectively,  which  are  defined  as  follows: 

(#  of  gaps  detected)  ^  V  .  r^,  x  K  (6) 

Rfr[  \Anv{c,r)\ 

(#  of  gaps  incorrectly  detected)  ^  ?  ^  x  (7) 

R  r=1  \Anv(c,  r)  | 

where  A(c,  r)  is  the  set  of  nodes  corresponding  to  gaps,  i.e.,  A  in  the  algorithm  in 
Section  3  detected  by  the  respective  method  at  coverage  c  in  the  r-th  iteration, 
while  A*(c,  r )  is  the  set  of  nodes  correctly  detected  among  them.  It  is  noted  that 
since  some  of  the  top  K  nodes  may  have  the  same  estimation,  these  numbers 
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Fig.  2.  Fluctuation  of  the  number  of  gaps  detected  by  the  naive,  SE,  and  RE  methods 
as  a  function  of  the  coverage  for  the  top  K  =  100  nodes  in  descending  order  of  the 
estimated  value  of  the  closeness  centrality  in  the  Ameblo,  Cosme,  and  Enron  networks 


are  normalized  by  the  number  of  gaps  detected  by  the  naive  method  \Anv(c,r)\ 
that  corresponds  to  the  number  of  node  pairs  Vi  and  t>j+i  having  different  esti¬ 
mations.  Thus,  the  blue  solid  line  for  the  naive  method  always  exhibits  the  best 
performance  ( =K ). 

From  these  results,  it  is  found  that  although  the  number  of  gaps  incorrectly 
detected  by  the  naive  method  decreases  as  the  coverage  becomes  larger,  it  is 
much  larger  than  the  ones  by  the  other  two  methods  that  are  almost  exactly 
0.  Whereas,  the  number  of  gaps  detected  either  by  the  SE  or  RE  method  is 
very  small  compared  to  the  one  by  the  naive  method.  Especially,  the  number  of 
gaps  detected  by  the  SE  method  increases  only  a  very  little  even  if  the  coverage 
becomes  closer  to  1.0.  On  the  other  hand,  the  number  of  gaps  detected  by  th 
RE  method  is  slightly  larger  than  the  one  by  the  SE  method  while  the  coverage 
is  small,  but  it  rapidly  increases  at  around  c  =  0.9  and  finally  becomes  100 
while  the  number  of  gaps  incorrectly  detected  remains  almost  0.  This  difference 
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Fig.  3.  Fluctuation  of  the  number  of  gaps  detected  by  the  naive,  SE,  and  RE  meth¬ 
ods  as  a  function  of  the  coverage  for  the  top  K  =  100  nodes  in  descending  order  of 
the  estimated  value  of  the  betweenness  centrality  in  the  Ameblo,  Cosme,  and  Enron 
networks 

comes  from  their  nature  that  the  resampling  error  RE(v;  |T|)  converges  to  0 
as  |T|  approaches  to  \V\,  while  the  standard  error  SE(v;  |T|)  does  not.  These 
tendencies  are  also  observed  in  the  results  for  the  betweenness  centrality  shown 
in  Fig.  3. 

Next,  we  examined  in  the  cases  of  I\  =  10  and  1,000.  Due  to  the  page 
limitation,  we  will  show  only  the  results  for  the  Ameblo  network  here,  but  we 
observed  the  same  tendencies  for  the  others.  Figures  4  and  5  show  the  results 
for  the  closeness  centrality  and  for  the  betweenness  centrality,  respectively.  From 
Figs.  4(a)  and  5(a),  the  number  of  gaps  incorrectly  detected  by  the  naive  method 
is  relatively  small  compared  to  the  results  for  K  =  100  although  it  is  still  larger 
than  the  ones  by  the  other  methods  that  are  almost  0  in  this  case,  too.  This  is 
because  the  higher-ranked  nodes  in  the  true  centrality  value  are  distinguishable 
as  shown  in  Fig.  1.  Due  to  the  same  reason,  the  number  of  gaps  detected  either 
by  the  SE  or  RE  method  is  relatively  large  compared  to  the  case  of  K  =  100. 
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Fig.  4.  Fluctuation  of  the  number  of  gaps  detected  by  the  naive,  SE,  and  RE  methods 
as  a  function  of  the  coverage  for  the  top  K  =  10  and  K  —  \ ,  000  nodes  in  descending 
order  of  the  estimated  value  of  the  closeness  centrality  in  the  Ameblo  network 
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Fig.  5.  Fluctuation  of  the  number  of  gaps  detected  by  the  naive,  SE,  and  RE  methods 
as  a  function  of  the  coverage  for  the  top  K  =  10  and  K  —  \ ,  000  nodes  in  descending 
order  of  the  estimated  value  of  the  betweenness  centrality  in  the  Ameblo  network 
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It  is  more  clearly  found  that  the  RE  method  can  correctly  detect  more  gaps 
than  the  SE  method  does  at  the  same  coverage  by  comparing  Figs.  4(b)  and 
4(c)  for  the  closeness  centrality,  and  by  comparing  Figs.  5(b)  and  5(c)  for  the 
betweenness  centrality.  Furthermore,  as  expected  above,  by  comparing  Figs.  4(b) 
and  5(b),  we  can  observe  that  the  number  of  gaps  detected  by  the  SE  method 
for  the  betweenness  centrality  is  larger  than  that  for  the  closeness  centrality. 
The  similar  tendency  can  be  observed  for  the  RE  method  from  Figs.  4(c)  and 
5(c).  On  the  other  hand,  we  can  observe  from  the  results  for  K  =  1,  000  that  the 
number  of  gaps  incorrectly  detected  by  the  naive  method  is  relatively  large,  and 
the  number  of  gaps  detected  by  the  other  methods  is  relatively  small,  compared 
to  the  other  results.  This  result  demonstrates  our  expectation  that  it  is  harder 
to  correctly  detect  gaps  that  exist  between  lower-ranked  nodes. 

To  summarize  the  above  results,  the  naive  method  is  not  reliable  for  a  large 
K.  It  can  detect  many  gaps  correctly  for  a  small  K1  say  10,  but  it  detects 
incorrect  gaps  if  the  coverage  is  low.  This  is  not  desirable  as  a  means  to  reduce 
the  computational  cost  for  detecting  nodes  having  a  high  centrality  value.  On 
the  other  hand,  the  SE  and  RE  methods  satisfactorily  detect  gaps  correctly 
regardless  of  the  value  of  coverage.  The  SE  method  is  more  conservative  by 
overestimating  the  error  margin  and  less  useful  than  the  RE  method  in  terms 
of  the  number  of  gaps  detected  at  the  same  coverage.  Note  that  although  the 
number  of  gaps  detected  by  the  RE  method  is  limited  for  a  low  coverage,  the 
resulting  gaps  are  more  likely  to  appear  between  nodes  having  a  high  centrality 
value,  which  is  desirable  for  us  to  detect  important  nodes  in  a  network. 

5  Conclusion 

In  this  paper,  we  addressed  a  problem  of  identifying  nodes  having  a  high  cen¬ 
trality  value  in  a  social  network  based  only  on  its  approximation  derived  from  a 
limited  number  of  sampled  nodes.  To  this  end,  we  focused  on  confidence  intervals 
of  true  centrality  value  for  each  node,  and  considered  detecting  gaps  that  divide 
a  set  of  nodes  into  two  groups  so  that  any  node  in  one  group  has  a  higher  central¬ 
ity  value  than  any  one  in  another  does  with  a  given  confidence  level.  To  estimate 
confidence  intervals  as  accurately  as  possible,  we  employed  the  resampling-based 
framework  for  estimation  of  the  approximation  error,  and  devised  an  algorithm 
that  can  efficiently  detect  gaps  whose  computational  complexity  is  0(|E|Zog|E|) 
for  the  number  of  nodes  in  a  network,  \V\,  which  is  much  less  than  0(|C|3) 
of  the  straightforward  approach.  Through  extensive  experiments  on  three  real 
world  social  networks  for  the  closeness  and  betweenness  centralities,  we  empir¬ 
ically  confirmed  that  the  proposed  method  can  correctly  detect  gaps  that  exist 
between  high-ranked  nodes  with  the  confidence  level  of  95%  even  for  a  partial 
network  whose  coverage  is  small,  say  0.2,  and  can  detect  more  gaps  compared 
to  the  one  that  uses  the  standard  error  to  estimate  confidence  intervals  at  the 
same  coverage  ratio.  Especially,  the  ratio  of  gaps  incorrectly  detected  to  the  total 
number  of  detected  gaps  is  almost  0  for  both  the  methods.  It  is  noted  that  the 
method  we  proposed  is  not  only  specific  to  identification  of  nodes  having  a  high 
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centrality  value,  but  also  applicable  to  any  other  estimation  problems  to  which 
the  resampling-based  estimation  framework  is  applicable.  We  believe  that  the 
conclusions  obtained  in  this  paper  can  generalize  but  we  have  yet  to  test  out  the 
proposed  method  in  a  broader  setting  and  in  different  domains,  too. 
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